Lets load the data and confirm that everything is OK.
# Access the package ggplot2
library(ggplot2)
# Load RATSL
library(readr)
RATSL <- read.csv("data/RATSL.csv")
#Refactoring, etc., as saving the file seems to have dropped the formatting
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
RATSL$WD <- as.character(RATSL$WD)
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
head(RATSL)
## ID Group WD Weight Time
## 1 1 1 WD1 240 1
## 2 2 1 WD1 225 1
## 3 3 1 WD1 245 1
## 4 4 1 WD1 260 1
## 5 5 1 WD1 255 1
## 6 6 1 WD1 260 1
From the Datacamp exercise we can find the description for the data
…nutrition study conducted in three groups of rats. The groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately) weekly, except in week seven when two recordings were taken) over a 9-week period. The question of most interest is whether the growth profiles of the three groups differ.
# Draw the plot
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
Figure 1: Individual response profiles by diet group for the RATS data
From the plot we can see that all the rats gained weight during the study. In group 1 we have one rat that had lower weight to begin with. Likewise, there was a similar case in group 3.
In group 2 we have one clear outlier, a rat that had much higher weight in the beginning compared to other rats in that group.
The slope of the line seems to be smallest in group 1, which suggests the rats in that group did not gain much weight during the study.
Tracking is visible on average in all groups, even if some deviations can be seen, most clearly with group 2 (tracking = rats who have higher weight values at the beginning tend to have higher values throughout the study).
There is not really change in the variation between the rats during the study.
library(tidyr)
library(dplyr)
# Standardise the variable bprs
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdWeight = (Weight - mean(Weight))/sd(Weight) ) %>%
ungroup()
# Glimpse the data
glimpse(RATSL)
## Observations: 176
## Variables: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1...
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 44...
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8...
## $ stdWeight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8...
# Draw the plot
ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(name = "standardised RATS")
Figure 1: Individual response profiles by diet group for the standardised RATS weight data
As with the MABS BPRS data, tracking can be seen more clearly from the standardised data.
# Number of weeks, baseline (week 0) included
n <- RATSL$Time %>% unique() %>% length()
# Summary data with mean and standard error of bprs by treatment
# and week
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se= sd(Weight) ) %>%
ungroup()
# Glimpse the data
glimpse(RATSS)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, ...
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 26...
## $ se <dbl> 15.22158, 13.09307, 11.47591, 13.60081, 11.05748, 11.783...
# Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.35)) +
scale_y_continuous(name = "mean(RATS) +/- se(RATS)")
From the mean profiles plot we can confirm that the average weight of the rats rose during the study. There were some deviations in the growth: between the first and second observation for group 3, and between observations six and seven for the group 1 and group 2 the rats lost weight. The average weight for the rats in group 1 was much lower than the rats in the other two.
Lets draw the alternative for the mean response graph, the side-by-side box plots of the observations at each time point.
#Little editing as using "week" seems to plot the wrong way
RATSLP<-RATSL
RATSLP$WD[RATSLP$WD == "WD1"] <- "WD01"
RATSLP$WD[RATSLP$WD == "WD8"] <- "WD08"
# Draw a boxplot of the mean versus treatment
ggplot(RATSLP, aes(x = RATSLP$WD, y = Weight, col = Group)) +
geom_boxplot() +
xlab("Week")
Figure 1: Boxplot of the RATS data
We can try spotting the outliers by drawing boxplots for each of the study groups. Lets create a summary data by Group and ID with mean as the summary variable (ignoring baseline week 1).
RATSL8S <- RATSL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
# Glimpse the data
glimpse(RATSL8S)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, ...
# Draw a boxplot of the mean versus treatment
ggplot(RATSL8S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(RATS), 11 observations")
From the boxplot we can see that there is an outlier in all groups, total three outliers. Let’s remove the outliers and replot.
# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATSL8S1 <- filter(RATSL8S, (Group == 1 & mean > 250) | (Group == 2 & mean < 590) | (Group == 3 & mean > 500))
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(RATS), 11 observations")
The boxplots changed quite dramatically, especially for group 2. There is lot less variance in the averages of the study groups. The groups also appear to have their distinct average level of weight.
Now, my opinion is that there is no sense in testing the groups with t-test similar way as in MABS for BPRS data (and F-test, as there are now three groups). None of the plots indicated a lack of difference between the diet groups - the difference between them is clear already from the plots!
I could not find more information about how the rats are divided to groups by diet, other that that “they’re on different diets”. There seems to be no control group, or I have no idea what the different groups are. More info about the data can be found from here.
Lets load the data and confirm that everything is OK.
# Access the package ggplot2
library(ggplot2)
# Load BPRSL
library(readr)
BPRSL <- read.csv("data/BPRSL.csv")
BPRS <- read.csv("data/BPRS.csv")
#Refactoring, etc., as saving the file seems to have dropped the formatting
BPRSL$subject <- factor(BPRSL$subject)
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$weeks <- as.character(BPRSL$weeks)
BPRS$subject <- factor(BPRS$subject)
BPRS$treatment <- factor(BPRS$treatment)
str(BPRSL)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
head(BPRSL)
## treatment subject weeks bprs week
## 1 1 1 week0 42 0
## 2 1 2 week0 58 0
## 3 1 3 week0 54 0
## 4 1 4 week0 55 0
## 5 1 5 week0 72 0
## 6 1 6 week0 48 0
From the Datacamp exercise we know the following about the data:
in which 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.
Let’s XY plot the data.
# initialize plot with data and aesthetic mapping
p1 <- ggplot(BPRSL, aes(x = week, y = bprs, col=treatment))
# define the visualization type (points)
p2 <- p1 + geom_point(size = 3, alpha = .3)
# add a regression line
p3 <- p2 + geom_smooth(method = "lm")
#draw the plot
p4 <- p3 + theme_minimal()
p4
Figure 1: Plot of bprs against time for subject data, ignoring the repeated-measures structure of the data but identifying the group to which each observation belongs.
From the fitted regression line, we can see that the bprs goes down with the data. This has the interpretation, that during the study the patients’ condition rated less severe in the end.
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line(aes(col=treatment)) +
scale_linetype_manual(values = rep(1:10, times=4)) +
# facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
Figure 1: Plot of individual bprs profile by treatment group
There does not seem to be much difference between the two treatment groups. Let’s plot the pairs
pairs(BPRS[3:11])
Scatterplot matrix of repeated measures in BPRS data.
For drawing the pairs plot we used the wide form of the data. We can observe linear pattern (dependence) in some of the pairs in the lower right corner.
Let’s create a linear model despite the repeated-measures structure of the data, using bprs as response and week and treatment as explanatory variables.
# create a regression model RATS_reg model
# bprs ~ week + d1
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
# print out a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
From the regression we can see that treatment2 is not statistically significant on any level conditional to time (week). This means we cannot reject the null hypothesis that its coefficient is zero and it has no effect on bprs. The coefficient for week is negative and statistically significant at 99.9% confidence level. We can try testing if the week and treatment dummy variable are jointly zero.
#create a dummy variable, which gets value '1' when subject belongs into
#treatment group 2
d1 <- as.numeric(BPRSL$treatment)-1
#d1 <- factor(d1)
BPRS_reg <- lm(bprs ~ week + d1, data = BPRSL)
library(car)
ss_results <- linearHypothesis(BPRS_reg, c("week = 0", "d1 = 0"))
print(ss_results)
## Linear hypothesis test
##
## Hypothesis:
## week = 0
## d1 = 0
##
## Model 1: restricted model
## Model 2: bprs ~ week + d1
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 359 66985
## 2 357 54584 2 12401 40.553 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We find evidence for these variables not being jointly zero.
However, as is stated in MABS, the model assumes independence of the repeated measures, and this assumption is highly unlikely.
Now the regression model is \[y_{ij} \sim (\beta_0+u_i)+\beta_1t_j +\beta_2D_i+\epsilon_{ij}\]
# access library lme4
library(lme4)
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
VC<-VarCorr(BPRS_ref)
# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Variance for the interept term is 47.4 which seems quite large, even if it’s maybe a little smaller than the variation found in the MABS RATS data, after taking the scale of the values for response variable into consideration.
If we take the standard deviation 6.9, which has the same scale of measure as bprs, the deviation of the intercept does not seem that big. Unfortunately I’m not familiar enough with the analysis for longitudinal data to interpret if the variation is big enough to cause concern.
The estimates for the variables in the linear mixed model are exactly the same with four digits precision as in the earlier regression, where independence was presumed.
The standard error for week is now somewhat smaller (0.2084) than in the earlier regression (0.2524) (MABS: assuming independence will lead to the standard error of a within-subject covariate such as time).
The standard error for treatment is now (1.0761). Earlier standard error (1.3035) was bigger. This is the opposite situation than there is in MABS with the RATS data.
Now the model is \[y_{ij} \sim (\beta_0+u_i)+(\beta_1+v_i)t_j +\beta_2D_i+\epsilon_{ij}\]
# create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8202 8.0511
## week 0.9608 0.9802 -0.51
## Residual 97.4307 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Variance for the intercept seeem to have grown, maybe because we added a variable into regression. The estimates for fixed effects are again the same, std. error for week is little bigger and for treatment little smaller.
The likelihood ratio test for the random intercept model versus the random intercept and slope model gives a chi-squared statistic of 7.27 with 2 degrees of freedom (DF). P-value 0.02636 is significant at 95% level, so we may reject the null. The sample size is on the small side though, so I’d like to use higher confidence level. If we reject the null hypothesis, the random intercept and slope model gives us a better fit. According to MABS the p-value we have obtained here is not without problems. Solution is to divide it by two (more information in MABS part 6 page 27). So the final p-value is 0.01318 and we end up rejecting the null.
The model with added group x time interaction term is \[y_{ij} \sim (\beta_0+u_i)+(\beta_1+v_i)t_j +\beta_2D_i+\beta_4(D_i\times t_j)+\epsilon_{ij}\]
# create a random intercept and random slope model
BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0767 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 65.0016 8.0624
## week 0.9688 0.9843 -0.51
## Residual 96.4699 9.8219
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2522 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The coefficients for fixed effects have changed somewhat, but are still quite close to the first regression. We can see that treatment group 2 has a negative coefficient of -0.424 (we can think group 1 as zero), which means that if subject belongs into group 2, bprs is lower than otherwise.
Let’s add the fitted values into BPRSL as a column.
## Warning: Removed 1 rows containing missing values (geom_path).
Fitted growth rate profiles from the interaction model and observed growth rate profiles
The interaction model does not seem to fit the observed data very well, as the fitted plots look much different.